Final Assignment

Prediction of Property Saling Price in New York City using sklearn

Introduction: to understand the impacts of the selected variables on housing prices in New York City, we first develop a model trained by the historical data of property sale price data in 2020. We include eight independent variables drawn from three different datasets, including total population, total housing units, people in labor force, travel time, median household income, total household, offense description for the area, building class category. To develop the model using historical data, we first split the 2020 sale data into the train and test datasets. We then use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the Random Forest Regressor.

Writter:

Yuanhao Zhai, Kaye Li

Requirement meet:

  • Data is collected through a means more sophisticated than downloading (e.g. scraping, API)
  • It combines data collected from 3 or more different sources.
  • The analysis of the data is reasonably complex, involving multiple steps (geospatial joins/operations, data shaping, data frame operations, etc).
  • You use an osmnx or pandana to perform an analysis of street network data
  • You use scikit-learn to perform a clustering analysis.
  • You perform a machine learning analysis with scikit-learn as part of the analysis.
  • The project includes multiple interactive visualizations that include a significant interactive component (cross-filtering, interactive widgets, etc)
Code
#import the essential packages
#base packages
import numpy as np
import pandas as pd
import geopandas as gpd

# Plotting packages
import seaborn as sns
from matplotlib import pyplot as plt
import holoviews as hv
import hvplot.pandas

# Sodapy API packages
import requests
from sodapy import Socrata

# Set a Random Seed
np.random.seed(42)
pd.options.display.max_columns = 999

Part 1: Collecting Data using API

Data Source:

NYC Citywide Annualized Calendar Sales Update

NYC 2020 census tract

NYPD Arrest Data (Year to Date)

1.1 Get the Property Saling data

To get the Property Saling Data from the NYC OpenData, We use the Socrata API to link the database.

Code
client = Socrata("data.cityofnewyork.us",
                  "6xHCd7htGtLFFnGXy6R9LdnXB",
                  username="zyuanhao@upenn.edu",
                  password="ZYHtt-0325")

Get the property saling price data from the NYC Open Data using Socrata API.

property_sales = client.get_all("w2pb-icbu")

# Convert to pandas DataFrame
prosal_df = pd.DataFrame.from_records(property_sales)
# clean the raw saling price data, drop all NA data
prosal_clean = prosal_df.drop(columns=['apartment_number','census_tract_2020','nta_code']).dropna()

# convert the str data to float64 data
prosal_clean['sale_price'] = pd.to_numeric(prosal_clean['sale_price'], errors='coerce')
prosal_clean['gross_square_feet'] = prosal_clean['gross_square_feet'].replace(',', '', regex=True)
prosal_clean['gross_square_feet'] = pd.to_numeric(prosal_clean['gross_square_feet'], errors='coerce')

# drop the extreme data(sale price>10000, gross square feet<30)
prosal_filtered = prosal_clean[(prosal_clean['sale_price'] > 10000) & (prosal_clean['gross_square_feet'] > 30)]

# convert df data to geodataframe with crs: 4326
prosal_clean_geo = gpd.GeoDataFrame(
    prosal_filtered, geometry=gpd.points_from_xy(prosal_filtered.longitude, prosal_filtered.latitude), crs="EPSG:4326"
)

To better compare the property price we have to get the sale price per square feet.

prosal_clean_geo['unit_sale_price'] = prosal_clean_geo['sale_price'] / prosal_clean_geo['gross_square_feet']
prosal_clean_geo = prosal_clean_geo.dropna()
prosal_clean_geo.head()

#clean the extrame data
Q1 =  prosal_clean_geo['unit_sale_price'].quantile(0.25)
Q3 =  prosal_clean_geo['unit_sale_price'].quantile(0.75)
IQR = Q3 - Q1

lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
prosal_last_geo = prosal_clean_geo[(prosal_clean_geo['unit_sale_price'] > lower) & (prosal_clean_geo['unit_sale_price'] < upper)]
prosal_last_geo.shape
(162890, 29)

Now, Let’s draw a box plot to explore the data

Code
# draw a box plot to explore the data
plt.boxplot(prosal_last_geo['unit_sale_price'])
plt.show()

From the plot above, we can see that average price of property is around $400.

And we can also draw a hitogram plot to see the distribution of the data.

Code
plt.hist(prosal_last_geo['unit_sale_price'], bins=25, edgecolor='black')

plt.title('Property Price Per Square Feet')
plt.xlabel('price per square feet')
plt.ylabel('count')

plt.show()

Most of the price is between $200 to $600.

1.2 Get the Public Safety Data

To further explain what fator contribute to the property price, we will collect the public safety data by using the NYPD Arrest data, this is a data with the catogory of crime and exact geographic position of crimes.

# get the NYPD Arrest Data (Year to Date) 
NYPD_arrest = client.get_all("uip8-fykc")

# Convert to pandas DataFrame
public_safe = pd.DataFrame.from_records(NYPD_arrest)
# Convert the dataframe to Geodataframe and drop the NA data.
safe_geo = gpd.GeoDataFrame(
    public_safe, geometry=gpd.points_from_xy(public_safe.longitude, public_safe.latitude), crs="EPSG:4326"
).drop(columns=['geocoded_column']).dropna()
safe_geo.head()
arrest_key arrest_date pd_cd pd_desc ky_cd ofns_desc law_code law_cat_cd arrest_boro arrest_precinct jurisdiction_code age_group perp_sex perp_race x_coord_cd y_coord_cd latitude longitude :@computed_region_f5dn_yrer :@computed_region_yeji_bk3q :@computed_region_92fq_4b7q :@computed_region_sbqj_enih :@computed_region_efsh_h5xi geometry
0 261209118 2023-01-01T00:00:00.000 109 ASSAULT 2,1,UNCLASSIFIED 106 FELONY ASSAULT PL 1200501 F K 77 0 45-64 F BLACK 999335 186085 40.677426 -73.945615 16 2 49 49 17618 POINT (-73.94562 40.67743)
1 262984267 2023-02-03T00:00:00.000 515 CONTROLLED SUBSTANCE,SALE 3 117 DANGEROUS DRUGS PL 2203901 F K 73 0 25-44 M BLACK 1009318 178259 40.655923 -73.90965 55 2 25 46 17614 POINT (-73.90965 40.65592)
2 263664549 2023-02-15T00:00:00.000 105 STRANGULATION 1ST 106 FELONY ASSAULT PL 1211200 F K 62 0 25-44 M WHITE 982272 158771 40.602468 -74.00712 1 2 44 37 17616 POINT (-74.00712 40.60247)
3 261345231 2023-01-04T00:00:00.000 105 STRANGULATION 1ST 106 FELONY ASSAULT PL 1211200 F M 32 0 25-44 M BLACK 999899 238684 40.821797 -73.943457 18 4 36 20 12427 POINT (-73.94346 40.82180)
4 263536618 2023-02-13T00:00:00.000 109 ASSAULT 2,1,UNCLASSIFIED 106 FELONY ASSAULT PL 12005WX F K 71 0 25-44 M BLACK 1001437 183080 40.669175 -73.938042 16 2 48 49 17615 POINT (-73.93804 40.66918)
Code
plt.figure(figsize=(5,10))
plt.hist(safe_geo['ofns_desc'],  orientation='horizontal', edgecolor='black')
(array([6.0572e+04, 4.7617e+04, 1.4415e+04, 1.5298e+04, 1.8410e+04,
        8.8820e+03, 2.5980e+03, 3.5700e+02, 2.2800e+02, 5.3000e+01]),
 array([ 0. ,  6.1, 12.2, 18.3, 24.4, 30.5, 36.6, 42.7, 48.8, 54.9, 61. ]),
 <BarContainer object of 10 artists>)

1.3 Get the census tract data

And besides the public safety, We assume that the population, household income, labor force, traffic time will also have effect on sale price, so we decide to get the data relate to those topic.

# get the census tract data available
import cenpy

available = cenpy.explorer.available()
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_dist(x, y):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_faces(triangle):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_mask_faces(mask, faces):
#we will use the data from 2020, by American Community Survey estimates
acs = cenpy.remote.APIConnection("ACSDP5Y2020")

#explore the data profile included in this database 
acs.variables.head()
label concept predicateType group limit predicateOnly hasGeoCollectionSupport attributes required
for Census API FIPS 'for' clause Census API Geography Specification fips-for N/A 0 True NaN NaN NaN
in Census API FIPS 'in' clause Census API Geography Specification fips-in N/A 0 True NaN NaN NaN
ucgid Uniform Census Geography Identifier clause Census API Geography Specification ucgid N/A 0 True True NaN NaN
DP02_0126E Estimate!!ANCESTRY!!Total population!!Arab SELECTED SOCIAL CHARACTERISTICS IN THE UNITED ... int DP02 0 NaN NaN DP02_0126EA,DP02_0126M,DP02_0126MA NaN
DP05_0050PE Percent!!RACE!!Total population!!One race!!Asi... ACS DEMOGRAPHIC AND HOUSING ESTIMATES float DP05 0 NaN NaN DP05_0050PEA,DP05_0050PM,DP05_0050PMA NaN
Code
pd.options.display.max_rows = 9999 
pd.options.display.max_colwidth = 200
variables = [
   "NAME",
   "DP05_0001E", #Total population
   "DP05_0086E", #Total Housing Units
   "DP03_0002E", #In Labor Force
   "DP03_0025E", #Mean travel time to work (minutes)
   "DP03_0062E", #Median household income (dollars)
   "DP02_0001E", #Total households
]

Tips

New York City is composed of 5 boroughs, with each borough also its own county. Manhattan is in New York County, Brooklyn is in Kings County, Queens is in Queens County, the Bronx is in Bronx County, and Staten Island is in Richmond County. The counties and boroughs of New York City are coterminous.

# First, get the NY County Data.
nyc_demo_data = acs.query(
    cols=variables,
    geo_unit="tract:*",
    geo_filter={"state": "36", "county": "061"},
)
# Then, Get the data from 4 other Counties
counties = [('36', '005'),('36','047'),('36','081'),('36','085')]  # Replace with actual FIPS codes

for state, county in counties:
    new_data = acs.query(cols=variables, geo_unit='tract', geo_filter={'state': state, 'county': county})
    nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\2929721321.py:6: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_demo_data = nyc_demo_data.append(new_data, ignore_index=True)
# Rename the data to our topic
nyc_census = nyc_demo_data.rename(columns={"DP05_0001E": "Total_Population",
                              "DP05_0086E": "Total Housing Units",
                              "DP03_0002E":"In Labor Force",
                              "DP03_0025E":"Travel_Time", 
                              "DP03_0062E":"Median_HH_Income",
                              "DP02_0001E":"Total_HH"})

nyc_census['GEOID'] = nyc_census['state'] + nyc_census['county'] + nyc_census['tract']
nyc_census.head()
NAME Total_Population Total Housing Units In Labor Force Travel_Time Median_HH_Income Total_HH state county tract GEOID
0 Census Tract 165, New York County, New York 6674 3841 3954 32.9 184691 3176 36 061 016500 36061016500
1 Census Tract 166, New York County, New York 6002 3279 2953 33.0 47778 2796 36 061 016600 36061016600
2 Census Tract 167, New York County, New York 6058 3804 3616 30.6 203711 2969 36 061 016700 36061016700
3 Census Tract 168, New York County, New York 5189 2102 1561 44.3 27222 1774 36 061 016800 36061016800
4 Census Tract 169, New York County, New York 8272 5016 5033 31.0 131097 3949 36 061 016900 36061016900
len(nyc_census) # We have 2327 Census Tracts in total.
2327
Code
# Then we will get the polygon of our census tracts in order to merge data with values
import pygris
nyc_tracts = pygris.tracts(
    state="36", county="061", year=2020
)

for state, county in counties:
    new_tracts = pygris.tracts(
    state=state, county=county, year=2020
)
    nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
C:\Users\zhaiy\AppData\Local\Temp\ipykernel_8424\405300660.py:9: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  nyc_tracts = nyc_tracts.append(new_tracts, ignore_index=True)
nyc_tracts.head()
STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00...
1 36 061 001600 36061001600 16 Census Tract 16 G5020 S 207381 0 +40.7159568 -073.9932660 POLYGON ((-73.99750 40.71407, -73.99709 40.71462, -73.99681 40.71504, -73.99653 40.71547, -73.99606 40.71623, -73.99543 40.71728, -73.99481 40.71846, -73.99383 40.71814, -73.99309 40.71792, -73.99...
2 36 061 001800 36061001800 18 Census Tract 18 G5020 S 222964 0 +40.7190464 -073.9908407 POLYGON ((-73.99442 40.71939, -73.99438 40.71952, -73.99426 40.71978, -73.99403 40.72032, -73.99379 40.72094, -73.99367 40.72126, -73.99352 40.72163, -73.99273 40.72140, -73.99224 40.72125, -73.99...
3 36 061 002000 36061002000 20 Census Tract 20 G5020 S 126280 137883 +40.7206299 -073.9750259 POLYGON ((-73.97872 40.71998, -73.97823 40.72067, -73.97802 40.72097, -73.97779 40.72128, -73.97768 40.72142, -73.97736 40.72186, -73.97735 40.72189, -73.97689 40.72250, -73.97616 40.72219, -73.97...
4 36 061 002201 36061002201 22.01 Census Tract 22.01 G5020 S 161668 0 +40.7191156 -073.9818443 POLYGON ((-73.98448 40.72023, -73.98382 40.72147, -73.98300 40.72122, -73.98217 40.72097, -73.98131 40.72071, -73.97973 40.72023, -73.97875 40.71993, -73.97904 40.71939, -73.97942 40.71867, -73.97...

Part 2: Spatial join the data to the census tract.

And take a look at the median housing price of different census tract.

merged_ct = pd.merge(nyc_tracts, nyc_census, on='GEOID', how='left')
merged_ct.head()
STATEFP COUNTYFP TRACTCE GEOID NAME_x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry NAME_y Total_Population Total Housing Units In Labor Force Travel_Time Median_HH_Income Total_HH state county tract
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501
1 36 061 001600 36061001600 16 Census Tract 16 G5020 S 207381 0 +40.7159568 -073.9932660 POLYGON ((-73.99750 40.71407, -73.99709 40.71462, -73.99681 40.71504, -73.99653 40.71547, -73.99606 40.71623, -73.99543 40.71728, -73.99481 40.71846, -73.99383 40.71814, -73.99309 40.71792, -73.99... Census Tract 16, New York County, New York 7504 3448 4271 31.0 60975 3147 36 061 001600
2 36 061 001800 36061001800 18 Census Tract 18 G5020 S 222964 0 +40.7190464 -073.9908407 POLYGON ((-73.99442 40.71939, -73.99438 40.71952, -73.99426 40.71978, -73.99403 40.72032, -73.99379 40.72094, -73.99367 40.72126, -73.99352 40.72163, -73.99273 40.72140, -73.99224 40.72125, -73.99... Census Tract 18, New York County, New York 7101 3383 3800 28.0 51480 3095 36 061 001800
3 36 061 002000 36061002000 20 Census Tract 20 G5020 S 126280 137883 +40.7206299 -073.9750259 POLYGON ((-73.97872 40.71998, -73.97823 40.72067, -73.97802 40.72097, -73.97779 40.72128, -73.97768 40.72142, -73.97736 40.72186, -73.97735 40.72189, -73.97689 40.72250, -73.97616 40.72219, -73.97... Census Tract 20, New York County, New York 4606 2040 1302 37.7 18750 1964 36 061 002000
4 36 061 002201 36061002201 22.01 Census Tract 22.01 G5020 S 161668 0 +40.7191156 -073.9818443 POLYGON ((-73.98448 40.72023, -73.98382 40.72147, -73.98300 40.72122, -73.98217 40.72097, -73.98131 40.72071, -73.97973 40.72023, -73.97875 40.71993, -73.97904 40.71939, -73.97942 40.71867, -73.97... Census Tract 22.01, New York County, New York 6372 3165 2759 35.7 25188 2988 36 061 002201
len(merged_ct)
2327

Explore the data by using a interactive map!

Code
merged_ct.explore("Median_HH_Income")
Make this Notebook Trusted to load map: File -> Trust Notebook
Code
merged_ct = merged_ct.to_crs(epsg='4326')
safe_merged = gpd.sjoin(merged_ct, safe_geo, how='left', op='intersects')
safe_merged.head()
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\IPython\core\interactiveshell.py:3448: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
  if await self.run_code(code, result, async_=asy):
STATEFP COUNTYFP TRACTCE GEOID NAME_x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON geometry NAME_y Total_Population Total Housing Units In Labor Force Travel_Time Median_HH_Income Total_HH state county tract index_right arrest_key arrest_date pd_cd pd_desc ky_cd ofns_desc law_code law_cat_cd arrest_boro arrest_precinct jurisdiction_code age_group perp_sex perp_race x_coord_cd y_coord_cd latitude longitude :@computed_region_f5dn_yrer :@computed_region_yeji_bk3q :@computed_region_92fq_4b7q :@computed_region_sbqj_enih :@computed_region_efsh_h5xi
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501 114437.0 273573301 2023-08-28T00:00:00.000 494 STOLEN PROPERTY 2,1,POSSESSION 111 POSSESSION OF STOLEN PROPERTY PL 1654501 F M 1 0 25-44 M BLACK HISPANIC 982722 197618 40.70909234324655 -74.0055114103435 56 4 32 1 13096
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501 71325.0 268056743 2023-05-10T00:00:00.000 109 ASSAULT 2,1,UNCLASSIFIED 106 FELONY ASSAULT PL 1200501 F M 1 0 65+ F ASIAN / PACIFIC ISLANDER 982593 197702 40.709325 -74.005974 56 4 32 1 13096
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501 60714.0 263536626 2023-02-13T00:00:00.000 339 LARCENY,PETIT FROM OPEN AREAS, 341 PETIT LARCENY PL 1552500 M M 1 0 25-44 F WHITE 982593 197702 40.709325 -74.005974 56 4 32 1 13096
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501 6811.0 266496509 2023-04-11T00:00:00.000 109 ASSAULT 2,1,UNCLASSIFIED 106 FELONY ASSAULT PL 1200502 F M 1 0 25-44 M BLACK 982380 197844 40.709715 -74.006744 56 4 32 1 13096
0 36 061 001501 36061001501 15.01 Census Tract 15.01 G5020 S 270812 166085 +40.7075653 -074.0013991 POLYGON ((-74.00860 40.71139, -74.00835 40.71136, -74.00797 40.71145, -74.00809 40.71146, -74.00815 40.71148, -74.00820 40.71149, -74.00822 40.71152, -74.00823 40.71155, -74.00822 40.71158, -74.00... Census Tract 15.01, New York County, New York 8021 4266 4352 25.5 103102 3720 36 061 001501 69172.0 271435118 2023-07-17T00:00:00.000 339 LARCENY,PETIT FROM OPEN AREAS, 341 PETIT LARCENY PL 1552500 M M 1 0 25-44 M BLACK HISPANIC 982380 197844 40.709715 -74.006744 56 4 32 1 13096
count_values = safe_merged.groupby('GEOID')['ofns_desc'].count()

merged_ct = merged_ct.merge(count_values, left_on='GEOID', right_index=True)
sales_census = gpd.sjoin(prosal_last_geo, merged_ct, how='left', op='intersects').drop(columns = ['index_right'])
sales_census.head()
C:\Users\zhaiy\mambaforge\envs\musa-550-fall-2023\lib\site-packages\IPython\core\interactiveshell.py:3448: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
  if await self.run_code(code, result, async_=asy):
borough neighborhood building_class_category tax_class_as_of_final_roll block lot building_class_as_of_final address zip_code residential_units commercial_units total_units land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of sale_price sale_date latitude longitude community_board council_district census_tract bin bbl nta geometry unit_sale_price STATEFP COUNTYFP TRACTCE GEOID NAME_x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON NAME_y Total_Population Total Housing Units In Labor Force Travel_Time Median_HH_Income Total_HH state county tract ofns_desc
5 1 CHELSEA 21 OFFICE BUILDINGS 4 802 75 O6 158 WEST 27 STREET 10001 0 14 14 8,305 108000.0 1913 4 O6 99350000.0 2019-10-24T00:00:00.000 40.746089 -73.992576 105 3 95 1015055 1008020075 Midtown-Midtown South POINT (-73.99258 40.74609) 919.907407 36 061 009500 36061009500 95 Census Tract 95 G5020 S 171964.0 0.0 +40.7472363 -073.9933578 Census Tract 95, New York County, New York 2985 1393 1547 22.7 136944 1148 36 061 009500 101.0
6 1 CHELSEA 21 OFFICE BUILDINGS 4 803 4 O4 307 7 AVENUE 10001 0 194 194 10,225 197612.0 1926 4 O4 115000000.0 2019-10-17T00:00:00.000 40.746869 -73.993616 105 3 95 1015061 1008030004 Midtown-Midtown South POINT (-73.99362 40.74687) 581.948465 36 061 009500 36061009500 95 Census Tract 95 G5020 S 171964.0 0.0 +40.7472363 -073.9933578 Census Tract 95, New York County, New York 2985 1393 1547 22.7 136944 1148 36 061 009500 101.0
8 1 CHELSEA 22 STORE BUILDINGS 4 772 72 K7 250 WEST 23RD STREET 10011 0 1 1 4,938 15716.0 1948 4 K7 14500000.0 2019-09-05T00:00:00.000 40.744698 -73.997091 104 3 91 1014135 1007720072 Hudson Yards-Chelsea-Flatiron-Union Square POINT (-73.99709 40.74470) 922.626623 36 061 009100 36061009100 91 Census Tract 91 G5020 S 179098.0 0.0 +40.7447178 -073.9951959 Census Tract 91, New York County, New York 6046 4145 4450 22.3 152946 3524 36 061 009100 118.0
11 1 CHELSEA 23 LOFT BUILDINGS 4 799 67 L1 148 WEST 24TH STREET, 4 FL 10011 0 15 15 4,937 55923.0 1910 4 L1 4500000.0 2019-02-28T00:00:00.000 40.744149 -73.993721 104 3 91 1014966 1007990067 Hudson Yards-Chelsea-Flatiron-Union Square POINT (-73.99372 40.74415) 80.467786 36 061 009100 36061009100 91 Census Tract 91 G5020 S 179098.0 0.0 +40.7447178 -073.9951959 Census Tract 91, New York County, New York 6046 4145 4450 22.3 152946 3524 36 061 009100 118.0
14 1 CHELSEA 30 WAREHOUSES 4 794 9 E9 157 WEST 18TH STREET 10011 0 1 1 3,620 23800.0 1906 4 E9 23200000.0 2019-02-20T00:00:00.000 40.740526 -73.996575 104 3 87 1014713 1007940009 Hudson Yards-Chelsea-Flatiron-Union Square POINT (-73.99658 40.74053) 974.789916 36 061 008700 36061008700 87 Census Tract 87 G5020 S 165443.0 0.0 +40.7422439 -073.9969962 Census Tract 87, New York County, New York 6347 4208 4855 23.0 152774 3951 36 061 008700 631.0
sales_census = sales_census.dropna()
len(sales_census)
162879

Part 3: SKlearn Train a Random Forest on Property Sales

Then, We get all the data we need to train a predict model, we can start training right now!

Code
# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# Model selection
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

# Pipelines
from sklearn.pipeline import make_pipeline

# Preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

3.1 set a pipeline include all variables

Code
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Numerical columns
num_cols = [
    "Total_Population",
    "Total Housing Units",
    "In Labor Force",
    "Travel_Time",
    "Median_HH_Income",
    "Total_HH",
    "ofns_desc",
]

# Categorical columns
cat_cols = [
    "building_class_category", 
]
# Set up the column transformer with two transformers
# ----> Scale the numerical columns
# ----> One-hot encode the categorical columns

transformer = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)
# Initialize the pipeline
# NOTE: only use 20 estimators here so it will run in a reasonable time
pipe = make_pipeline(
    transformer, RandomForestRegressor(n_estimators=20, random_state=42)
)

3.2 Split 30% & 70%

# Split the data 70/30
train_set, test_set = train_test_split(sales_census, 
                                       test_size=0.3, 
                                       random_state=42)

# the target labels: log of sale price
y_train = np.log(train_set["unit_sale_price"])
y_test = np.log(test_set["unit_sale_price"])
x_train = train_set[
   [ "Total_Population",
    "Total Housing Units",
    "In Labor Force",
    "Travel_Time",
    "Median_HH_Income",
    "Total_HH",
    "ofns_desc",
    "building_class_category",]
]
x_test = test_set[
   ["Total_Population",
    "Total Housing Units",
    "In Labor Force",
    "Travel_Time",
    "Median_HH_Income",
    "Total_HH",
    "ofns_desc",
    "building_class_category",]
]
# Fit the training set
pipe.fit(train_set, y_train);
# What's the test score?
pipe.score(test_set, y_test)
0.4906609622139816

3.3 Use GridSearchCV to perform a k-fold cross validation that optimize at least 2 hyperparameters of the RandomForestRegressor.

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Create our regression pipeline
pipe2 = Pipeline(steps=[('preprocessor', transformer),
                              ('regressor', RandomForestRegressor())])
pipe2
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['Total_Population',
                                                   'Total Housing Units',
                                                   'In Labor Force',
                                                   'Travel_Time',
                                                   'Median_HH_Income',
                                                   'Total_HH', 'ofns_desc']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['building_class_category'])])),
                ('regressor', RandomForestRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipe2.named_steps
{'preprocessor': ColumnTransformer(transformers=[('num', StandardScaler(),
                                  ['Total_Population', 'Total Housing Units',
                                   'In Labor Force', 'Travel_Time',
                                   'Median_HH_Income', 'Total_HH',
                                   'ofns_desc']),
                                 ('cat', OneHotEncoder(handle_unknown='ignore'),
                                  ['building_class_category'])]),
 'regressor': RandomForestRegressor()}
param_grid = {
    'regressor__n_estimators': [100, 200, 300],  
    'regressor__max_depth': [None, 10, 20, 30],  
    
}
param_grid
{'regressor__n_estimators': [100, 200, 300],
 'regressor__max_depth': [None, 10, 20, 30]}
grid_search = GridSearchCV(pipe2, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)
grid_search.fit(x_train, y_train)
Fitting 5 folds for each of 12 candidates, totalling 60 fits
GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         StandardScaler(),
                                                                         ['Total_Population',
                                                                          'Total '
                                                                          'Housing '
                                                                          'Units',
                                                                          'In '
                                                                          'Labor '
                                                                          'Force',
                                                                          'Travel_Time',
                                                                          'Median_HH_Income',
                                                                          'Total_HH',
                                                                          'ofns_desc']),
                                                                        ('cat',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['building_class_category'])])),
                                       ('regressor', RandomForestRegressor())]),
             n_jobs=-1,
             param_grid={'regressor__max_depth': [None, 10, 20, 30],
                         'regressor__n_estimators': [100, 200, 300]},
             scoring='neg_mean_squared_error', verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid_search.best_estimator_
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num', StandardScaler(),
                                                  ['Total_Population',
                                                   'Total Housing Units',
                                                   'In Labor Force',
                                                   'Travel_Time',
                                                   'Median_HH_Income',
                                                   'Total_HH', 'ofns_desc']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['building_class_category'])])),
                ('regressor',
                 RandomForestRegressor(max_depth=30, n_estimators=300))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
grid_search.best_params_
{'regressor__max_depth': 30, 'regressor__n_estimators': 300}
def evaluate_mape(model, X_test, y_test):
    """
    Given a model and test features/targets, print out the 
    mean absolute error and accuracy
    """
    # Make the predictions
    predictions = model.predict(X_test)

    # Absolute error
    errors = abs(predictions - y_test)
    avg_error = np.mean(errors)

    # Mean absolute percentage error
    mape = 100 * np.mean(errors / y_test)

    # Accuracy
    accuracy = 100 - mape

    print("Model Performance")
    print(f"Average Absolute Error: {avg_error:0.4f}")
    print(f"Accuracy = {accuracy:0.2f}%.")

    return accuracy
# Initialize the pipeline
base_model = make_pipeline(
    transformer, RandomForestRegressor(n_estimators=20, random_state=42)
)

# Fit the training set
base_model.fit(x_train, y_train)

# Evaluate on the test set
base_accuracy = evaluate_mape(base_model, x_test, y_test)# Initialize the pipeline
Model Performance
Average Absolute Error: 0.3403
Accuracy = 93.23%.
# Initialize the pipeline
best_model = grid_search.best_estimator_

# Fit the training set
best_model.fit(x_train, y_train)

# Evaluate on the test set
best_accuracy = evaluate_mape(best_model, x_test, y_test)
Model Performance
Average Absolute Error: 0.3354
Accuracy = 93.31%.

Errors:

we first test the model on the 0.3 split of 2020 data and received a test score of 0.49. This score is good enough for us to proceed with this model. For the model on historical data of 2020, we achieved an accuracy of 93.23%; whereas for the trained model, we achieved an accuracy of 93.31%. We then calculated the percentage error for each census tract and created a choropleth map to see the spatial distribution of errors. It can be seen that the error level is consistently low across most of the areas, except two or three areas as outliners. This means that our model is relatively spatially consistent.

3.4 Make a data frame with percent errors and census tract info for each sale in the test set

Create a data frame that has the property geometries, census tract data, and percent errors for all of the sales in the test set.

predictions = best_model.predict(x_test)

# Absolute error
errors =  abs((y_test - predictions) / y_test) * 100
errors.head()
282469    0.116377
287738    7.274291
328679    0.732932
218207    3.038824
63011     3.530113
Name: unit_sale_price, dtype: float64
test_set['percent_error'] = errors
test_set.head()
borough neighborhood building_class_category tax_class_as_of_final_roll block lot building_class_as_of_final address zip_code residential_units commercial_units total_units land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of sale_price sale_date latitude longitude community_board council_district census_tract bin bbl nta geometry unit_sale_price STATEFP COUNTYFP TRACTCE GEOID NAME_x NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON NAME_y Total_Population Total Housing Units In Labor Force Travel_Time Median_HH_Income Total_HH state county tract ofns_desc percent_error
282469 3 BEDFORD STUYVESANT 03 THREE FAMILY DWELLINGS 1 1807 82 C0 363 GATES AVE 11216 3 0 3 2,000 2880.0 1899 1 C0 1150000.0 2016-05-10T00:00:00.000 40.685994 -73.952625 303 36 243 3050859 3018070082 Bedford POINT (-73.95262 40.68599) 399.305556 36 047 024300 36047024300 243 Census Tract 243 G5020 S 154349.0 0.0 +40.6874555 -073.9528560 Census Tract 243, Kings County, New York 4214 2093 2647 43.2 61475 1883 36 047 024300 45.0 0.116377
287738 4 ROSEDALE 01 ONE FAMILY DWELLINGS 1 13750 176 A2 147-63 EDGEWOOD STREET 11422 1 0 1 5,200 1369.0 1955 1 A2 276500.0 2016-05-31T00:00:00.000 40.65567 -73.741928 413 31 664 4292017 4137500176 Rosedale POINT (-73.74193 40.65567) 201.972243 36 081 066401 36081066401 664.01 Census Tract 664.01 G5020 S 1112816.0 103375.0 +40.6471780 -073.7430504 Census Tract 664.01, Queens County, New York 1444 424 883 52.8 141250 378 36 081 066401 22.0 7.274291
328679 2 BAYCHESTER 01 ONE FAMILY DWELLINGS 1 4780 46 A5 3020 ELY AVENUE 10469 1 0 1 2,651 1260.0 1960 1 A5 320000.0 2016-11-10T00:00:00.000 40.871916 -73.836824 212 12 46202 2062294 2047800046 Co-op City POINT (-73.83682 40.87192) 253.968254 36 005 046208 36005046208 462.08 Census Tract 462.08 G5020 S 485072.0 0.0 +40.8750208 -073.8369671 Census Tract 462.08, Bronx County, New York 5586 1494 2748 58.0 73578 1460 36 005 046208 38.0 0.732932
218207 QUEENS FAR ROCKAWAY 02 TWO FAMILY DWELLINGS 1 15790 27 B3 414 BEACH 27 STREET 11691 2 0 2 3,600 1768.0 1920 1 B3 350000.0 2021-03-05T00:00:00.000 40.59785 -73.760714 414 31 99801 4301435 4157900027 Far Rockaway-Bayswater POINT (-73.76071 40.59785) 197.963801 36 081 099801 36081099801 998.01 Census Tract 998.01 G5020 S 456707.0 0.0 +40.5980835 -073.7575992 Census Tract 998.01, Queens County, New York 8631 2732 4917 55.3 66532 2634 36 081 099801 91.0 3.038824
63011 4 MASPETH 02 TWO FAMILY DWELLINGS 1 2779 44 B2 68-07 59TH DRIVE 11378 2 0 2 2,000 1836.0 1931 1 B2 965000.0 2019-09-25T00:00:00.000 40.721695 -73.893983 405 30 66701 4062280 4027790044 Middle Village POINT (-73.89398 40.72169) 525.599129 36 081 066701 36081066701 667.01 Census Tract 667.01 G5020 S 252633.0 0.0 +40.7218219 -073.8913365 Census Tract 667.01, Queens County, New York 3047 1055 1413 43.6 109390 994 36 081 066701 12.0 3.530113
median_errors = test_set.groupby('GEOID')['percent_error'].median().reset_index()

error_map = merged_ct.merge(median_errors, on='GEOID')

fig, ax = plt.subplots(1, 1, figsize=(10, 6))
error_map.plot(column='percent_error', ax=ax, legend=True,
               legend_kwds={'label': "Median Percent Error by Census Tract"},
               cmap='OrRd')  # Choose a colormap that fits your data and preference
ax.set_axis_off()